Stellar systems in blue, confirmed exoplanetary systems in red
Introduction
¶Exoplanet detection is a critical aspect of modern astronomy / astrophysics. Currently, there are around 5500 confirmed exoplanets, and around 7000 candidate exoplanets. The main method used to discover exoplanets is via transit of the planet across the star, however this occurence is extremely unlikely, as it requires the orbital plane of the planetary system to be nearly edge-on as observed by telescopes in the solar system. Furthermore, the transit of a planet across a star takes a very small fraction of time compared to the period of the planet. Consequently, even if an eligible planetary system is being observered, it is unlikely a transit will be detected, let alone multiple transits at regular intervals, which is required to deduce exoplanet existence. Although transit accounts for the large majority of exoplanet discoveries, these limiations result in nearly all exoplanets in surrounding planetary systems to go undiscovered.
Understanding the existence/number of exoplanets that should be in a given system, is extremely important to astronomical research/science, as it allows researchers to have a better understanding of what factors play into the formation of exoplanets, and allows scientists to have a deeper understanding of our universe and potential extraterrestrial life.
This project's goal is to create a machine learning model that allows us to predict the existence, and if so number, of exoplanets in a given stellar system. We are able to do that using the assumption that similar number of planets will form in similar stellar systems. For example, if a stellar system is older, it could be more likely to have exoplanets. We will test this hypothesis, as well as look at many other variables, in the steps outlined below.
Data Collection and Cleaning
¶We will be using Python, as well as Jupyter Notebooks to visualize and develop our code. We will begin by collecting, and cleaning the data we will use for this project. First, lets import any necessary libraries. We will be using: numpy pandas matplotlib seaborn scipy scikit-learn
#imports
#standard libraries for data science
import numpy as np
import pandas as pd
#python visualization/plotting libraries
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import seaborn as sns
#regression/machine learning libraries
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
Downloading/Importing Data
¶For this project, the data sources we are using are:
NASA's Exoplanet Archive - for exoplanetary data
Henry Draper (HD) Catalogue - for stellar data
After downloading these datasets from the website, we will begin to process it. We will convert the Exoplanet Archive catalogue dataset to a Pandas dataframe, and remove the redundant "HD" from it's 'hd_name' column, and generally clean it up, which will allowsus to cross reference the data to the Henry Draper data.
#data processing
#csv to pandas
exoDf = pd.read_csv('ExoplanetData.csv')
#remove HD from HD name column, redundant, also remove A
exoDf['hd_name'] = exoDf['hd_name'].str.extract('(\d+)', expand=False)
exoDf['hd_name'] = pd.to_numeric(exoDf['hd_name'], errors='coerce').fillna(exoDf['hd_name'])
exoDf['hd_name'] = exoDf['hd_name'].astype('Int64')
Lets do the same with the HD Catalogue data.
#csv
hdDf = pd.read_csv('HDCatalogue.csv', sep=';')
Currently there are 121 different columns, lets reduce this to some usable ones.
lets keep age, mass, name, etc,
and anything that can be applied for regression, leaving out unneeded data like flags,
discovery methods, and so on.
The variables that we have chosen to keep, although all might not be necessary, are:
pl_name - The name of the exoplanet
hostname - The name of the host star
hd_name - The Henry Draper (HD) cataloge ID
sy_snum - The number of stars in the system
sy_pnum - The number of planets in the system
sy_mnum - The number of moons of the planet
st_spectype - The spectral type of the host star
st_teff - The effective temperature of the host star (K)
st_rad - The radius of the host star (solar radii)
st_mass - The mass of the host star (solar masses)
st_logg - The surface gravity of the host star (log10(cm/s^2))
st_age - The age of the host star (billions of years)
st_dens - The density of the host star (gm/cm^3)
ra - Right ascension (degrees)
dec - Declination (degrees)
sy_dist - Distance (parsecs) \
Note that we are keeping alot of data about the star in the system, as like mentioned before we are going to rely on information about the star to make assumptions on the number of exoplanets in that system.
columns_to_keep = ['pl_name', 'hostname', 'hd_name', 'sy_snum', 'sy_pnum',
'sy_mnum', 'st_spectype', 'st_teff', 'st_rad', 'st_mass',
'st_logg', 'st_age', 'st_dens', 'ra', 'dec', 'sy_dist',]
exoDf = exoDf[columns_to_keep]
We have a very small amount of na/missing exoplanet data, so we will use mean imputation to fix that.
columns_to_impute = ['st_age', 'st_mass', 'st_dens', 'st_rad', 'st_teff', 'st_logg', 'sy_dist']
for column in columns_to_impute:
exoDf[column] = exoDf[column].fillna(exoDf[column].mean())
Data Visualization/Exploration
¶Now that we have our data cleaned, and properly stored in our two pandas dataframes
exoDf and hdDf, let's start with some exploratory data visualization, to understand our data better.
I am going to begin by creating a graph, of all of the stars in the Henry Draper
Catalogue, as well as all of the stars with confirmed exoplanets. This will
allow us to see if there is any selection bias in where the exoplanet discoveries
have been made, and in general will result in a neat visualization.
We can do this by utilizing the "right ascension," and "declination" values of our systems, which are the astronometrical standard for stellar coordinates.
#converts degree in range 0-360 to a radian in range -pi to pi
def deg_from_neg_pi_to_pi(deg):
return (deg * np.pi / 180) - np.pi
#converts a degree in range -90 to 90 to a radian in range -pi/2 to pi/2
def deg_to_rad_90(deg):
return (deg * np.pi / 180)
raHd = hdDf['_RAJ2000'].apply(deg_from_neg_pi_to_pi)
raExo = exoDf['ra'].apply(deg_from_neg_pi_to_pi)
decHd = hdDf['_DEJ2000'].apply(deg_to_rad_90)
decExo = exoDf['dec'].apply(deg_to_rad_90)
# Set a higher DPI for better resolution
fig = plt.figure(figsize=(15, 10), dpi=500)
ax = plt.subplot(111, projection='aitoff')
# Initialize the plot with initial data
sc_hd = ax.scatter(raHd, decHd, s=1, c='blue', alpha=0.1)
sc_exo = ax.scatter(raExo, decExo, s=1, c='red', alpha=0.3)
ax.set_xlabel('Right Ascension')
ax.set_ylabel('Declination')
ax.set_title('Systems with Confirmed Exoplanets')
ax.grid(True)
ax.set_facecolor('white')
# Update function for animation
def update(frame):
global sc_hd, sc_exo
# Shift in radians
shift_radians = np.deg2rad(frame) % (2 * np.pi)
# Update the RA for HD stars
new_ra_hd = ((raHd + shift_radians + np.pi) % (2 * np.pi)) - np.pi
# Update the RA for exoplanets
new_ra_exo = ((raExo + shift_radians + np.pi) % (2 * np.pi)) - np.pi
# Update the scatter plot data
sc_hd.set_offsets(np.column_stack((new_ra_hd, decHd)))
sc_exo.set_offsets(np.column_stack((new_ra_exo, decExo)))
return sc_hd, sc_exo
#uncomment if you want to make the animation!
#ani = animation.FuncAnimation(fig, update, frames=np.arange(0, 360, 3), interval = 100, blit=True, repeat=True)
#ani.save('stars_animation.gif', writer='ffmpeg', dpi=200)
plt.show()